Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subtree: Fix handling of complex history #493

Open
wants to merge 7 commits into
base: maint
Choose a base branch
from

Conversation

tqc
Copy link

@tqc tqc commented Dec 16, 2019

Fixes several issues that could occur when running subtree split on large repos with more complex history.

  1. A merge commit could bypass the known start point of the subtree, which would cause the entire history to be processed recursively, leading to a stack overflow / segfault after reading a few hundred commits. Older commits are now explicitly recorded as irrelevant so that the recursive process can terminate on any mainline commit rather than only on subtree joins and initial commits.

  2. It is possible for a repo to contain subtrees that lack the metadata that is usually present in add/join commit messages (git-svn at least can produce such a structure). The new use/ignore/map commands allow the user to provide that information for any problematic commits.

  3. A mainline commit that does not contain the subtree folder could be erroneously identified as a subtree commit, which would add the entire mainline history to the subtree. Commits will now only be used as is if all their parents are already identified as subtree commits. While the new code can still be tripped up by unusual folder structures, the completely unambiguous solution turned out to involve a significant performance penalty, and the new ignore / use commands provide a workaround for that scenario.

Cc: Avery Pennarun apenwarr@gmail.com
cc: Ed Maste emaste@freebsd.org
cc: Johannes Schindelin Johannes.Schindelin@gmx.de

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 16, 2019

Welcome to GitGitGadget

Hi @tqc, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that this Pull Request has a good description, as it will be used as cover letter.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "commit:", and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the FreeNode IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions.

If you want to see what email(s) would be sent for a submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this.

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox ("raw") file corresponding to the mail you want to reply to from the Git mailing list. If you use GMail, you can upload that raw mbox file via:

curl -g --user "<EMailAddress>:<Password>" --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

Need help?

New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Freenode. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@dscho
Copy link
Member

dscho commented Dec 16, 2019

/allow

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 16, 2019

User tqc is now allowed to use GitGitGadget.

@tqc
Copy link
Author

tqc commented Dec 17, 2019

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 17, 2019

Preview email sent as pull.493.git.1576626351.gitgitgadget@gmail.com

@tqc tqc force-pushed the tqc/subtree branch 3 times, most recently from fa2cc8f to c0de1ce Compare December 20, 2019 04:40
@dscho
Copy link
Member

dscho commented Dec 21, 2019

@tqc c0de1ceab098953789369e22cbeae5c7f67b45c9 seems to be missing a commit message and a sign-off...

@tqc tqc changed the title subtree: Avoid unnecessary recursion subtree: Fix handling of complex history Dec 22, 2019
Copy link

@emaste emaste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, comment intended for the first patch in the series went here by accident.

@flip1995
Copy link

flip1995 commented May 7, 2020

Friendly ping @tqc. What would be necessary to get this PR ready for submisson? @rust-lang (rust-lang/rust-clippy#5565) is interested to move this forward, so if you need help completing this, we would be happy to support you or take over.

@dscho
Copy link
Member

dscho commented May 8, 2020

What would be necessary to get this PR ready for submisson?

Probably just one thing: the Git mailing list insists on continuing in lower-case after the subtree: prefix of the commit message, e.g. subtree: remove notree cache.

After that, I think it'll be /submit time (but GitGitGadget might point out other issues, I haven't looked in detail whether there are e.g. overly-long lines in the commit messages).

Also, the first comment (which will be used as a cover letter) mentions that documentation is missing, and I would also suggest to add a footer Cc: Avery Pennarun <apenwarr@gmail.com> to the first comment to notify the best reviewer for this.

@tqc
Copy link
Author

tqc commented May 9, 2020

Sorry guys, got hopelessly distracted between updating the docs and submitting the PR.

The tech piece is all done, I should be able to find time to do the final cleanup tomorrow

@tqc tqc force-pushed the tqc/subtree branch 3 times, most recently from b93871b to 19db9cf Compare May 11, 2020 03:45
@tqc
Copy link
Author

tqc commented May 11, 2020

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented May 11, 2020

Preview email sent as pull.493.git.1589175821.gitgitgadget@gmail.com

@tqc
Copy link
Author

tqc commented May 11, 2020

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented May 11, 2020

@emaste
Copy link

emaste commented Jun 16, 2020

GitHub UI fail, that (+ve review) was meant to apply to 74fa670

main=
sub=
local grep_format="^git-subtree-dir: $dir/*\$"
git log --reverse --grep="$grep_format" \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why the grep format isn't just expanded inline

Comment on lines 399 to 400
main=
sub=
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are main and sub?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An artifact of starting with a copy of find_existing splits that should be removed. Since you’re building on top of this, is it better to rewrite the history or fix in a new commit?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewriting history won't bother me, my WIP is going to need to be edited and rebased significantly anyway.

@@ -161,7 +162,7 @@ command="$1"
shift

case "$command" in
add|merge|pull)
add|merge|pull|map)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantic nit, might be useful to keep these in alpha order

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming logical rather than alphabetical ordering, where at least in the docs the more arcane options go towards the end. Happy to make the internal bits alphabetical though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair, it so happens the existing options are both logically & alphabetically ordered. Not a significant comment either way

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
@tqc
Copy link
Author

tqc commented Oct 5, 2020

Rebased on current maint and updated with the minor fixes to find_mainline ref from Ed's review. Will submit the updated version if the tests all pass and no other minor fixes come up

@dscho
Copy link
Member

dscho commented Oct 6, 2020

Will submit the updated version if the tests all pass and no other minor fixes come up

Please ignore the vs-build failure: this has been fixed already (in e58e405, but that fix did not make it to maint yet, obviously).

@tqc
Copy link
Author

tqc commented Oct 6, 2020

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2020

Preview email sent as pull.493.v2.git.1602021022.gitgitgadget@gmail.com

@tqc
Copy link
Author

tqc commented Oct 6, 2020

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 6, 2020

Submitted as pull.493.v2.git.1602021913.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-493/tqc/tqc/subtree-v2

To fetch this version to local tag pr-493/tqc/tqc/subtree-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-493/tqc/tqc/subtree-v2

@@ -238,7 +238,7 @@ cache_miss () {
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ed Maste wrote (reply to this):

On Tue, 6 Oct 2020 at 18:05, Tom Clarkson via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Tom Clarkson <tom@tqclarkson.com>
>
> Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
Reviewed-by: Ed Maste <emaste@FreeBSD.org>

@@ -238,13 +238,13 @@ cache_miss () {
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> From: Tom Clarkson <tom@tqclarkson.com>
>
> Include recursion depth in debug logs so we can see when the recursion is
> getting out of hand.
>
> Making the cache handle null mappings correctly and adding older commits
> to the cache allows the recursive algorithm to terminate at any point on
> mainline rather than needing to reach either the add point or the initial
> commit.

Makes sense.

> diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
> index 9867718503..160bad95c1 100755
> --- a/contrib/subtree/git-subtree.sh
> +++ b/contrib/subtree/git-subtree.sh
> @@ -244,7 +244,7 @@ check_parents () {
>  	do
>  		if ! test -r "$cachedir/notree/$miss"
>  		then
> -			debug "  incorrect order: $miss"
> +			debug "  unprocessed parent commit: $miss ($indent)"

Without any context, it is hard to understand what the `$indent` variable
is supposed to mean, so it is unclear why we need to print it here.

I _guess_ it is the degree removed from the first-parent lineage?

In any case, it does not hurt here, so I trust that it is good to include
it in the debug output.

>  			process_split_commit "$miss" "" "$indent"
>  		fi
>  	done
> @@ -392,6 +392,24 @@ find_existing_splits () {
>  	done
>  }
>
> +find_mainline_ref () {
> +	debug "Looking for first split..."
> +	dir="$1"
> +	revs="$2"

The `git-subtree` script seems to rely on the `local` construct, using it
in plenty of other circumstances. How about using it here, too?

> +
> +	git log --reverse --grep="^git-subtree-dir: $dir/*\$" \
> +		--no-show-signature --pretty=format:'START %H%n%s%n%n%b%nEND%n' $revs |

Since all you are interested in is the `git-subtree-mainline:` trailer,
wouldn't a format like `%(trailers:key=git-subtree-mainline)` instead of
`START %H%n%s%n%n%b%nEND%n`?

See
https://git-scm.com/docs/git-log#Documentation/git-log.txt-emtrailersoptionsem
for more information about pretty formats.

BTW I am super unfamiliar with `git subtree`'s inner workings, and
therefore it would help me incredibly if the commit message talked a bit
about the commit message layout (with a particular eye on
`git-subtree-dir` and `git-subtree-mainline` which I guess are trailers
added by `git subtree`?)...

> +	while read a b junk
> +	do
> +		case "$a" in
> +		git-subtree-mainline:)
> +			echo "$b"
> +			return
> +			;;
> +		esac
> +	done
> +}
> +
>  copy_commit () {
>  	# We're going to set some environment vars here, so
>  	# do it in a subshell to get rid of them safely later
> @@ -646,9 +664,9 @@ process_split_commit () {
>
>  	progress "$revcount/$revmax ($createcount) [$extracount]"
>
> -	debug "Processing commit: $rev"
> +	debug "Processing commit: $rev ($indent)"
>  	exists=$(cache_get "$rev")
> -	if test -n "$exists"
> +	if test -z "$(cache_miss "$rev")"
>  	then
>  		debug "  prior: $exists"

I do not see the `exists` variable being used other than for the debug
statement. Maybe better something like this?

	debug "  prior found for $rev"

>  		return
> @@ -773,6 +791,17 @@ cmd_split () {
>
>  	unrevs="$(find_existing_splits "$dir" "$revs")"
>
> +	mainline="$(find_mainline_ref "$dir" "$revs")"
> +	if test -n "$mainline"
> +	then
> +		debug "Mainline $mainline predates subtree add"
> +		git rev-list --topo-order --skip=1 $mainline |
> +		while read rev
> +		do
> +			cache_set "$rev" ""

Ah, so they are not really "null mappings", but mapped to an empty string.
Makes sense. Maybe adjust the commit message?

> +		done || exit $?
> +	fi
> +
>  	# We can't restrict rev-list to only $dir here, because some of our
>  	# parents have the $dir contents the root, and those won't match.
>  	# (and rev-list --follow doesn't seem to solve this)
> --
> gitgitgadget
>
>

@@ -27,6 +27,7 @@ b,branch= create a new branch from the split subtree
ignore-joins ignore prior --rejoin commits
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> @@ -48,6 +49,7 @@ annotate=
>  squash=
>  message=
>  prefix=
> +clearcache=

It might be more consistent to call it `clear_cache` (i.e. with an
underscore), just like `ignore_joins`.

>
>  debug () {
>  	if test -n "$debug"
> @@ -131,6 +133,9 @@ do
>  	--no-rejoin)
>  		rejoin=
>  		;;
> +	--clear-cache)
> +		clearcache=1
> +		;;
>  	--ignore-joins)
>  		ignore_joins=1
>  		;;
> @@ -206,9 +211,13 @@ debug "opts: {$*}"
>  debug
>
>  cache_setup () {
> -	cachedir="$GIT_DIR/subtree-cache/$$"
> -	rm -rf "$cachedir" ||
> -		die "Can't delete old cachedir: $cachedir"
> +	cachedir="$GIT_DIR/subtree-cache/$prefix"

Excellent, the `prefix` should be "unique enough".

> +	if test -n "$clearcache"
> +	then
> +		debug "Clearing cache"
> +		rm -rf "$cachedir" ||
> +			die "Can't delete old cachedir: $cachedir"
> +	fi
>  	mkdir -p "$cachedir" ||
>  		die "Can't create new cachedir: $cachedir"
>  	mkdir -p "$cachedir/notree" ||
> @@ -266,6 +275,16 @@ cache_set () {
>  	echo "$newrev" >"$cachedir/$oldrev"
>  }
>
> +cache_set_if_unset () {
> +	oldrev="$1"
> +	newrev="$2"

`local`? ;-)

> +	if test -e "$cachedir/$oldrev"
> +	then
> +		return
> +	fi
> +	echo "$newrev" >"$cachedir/$oldrev"

So that directory contains commit mappings, a file for each mapped
revision.

Thinking back to patch 2/11, I am now no longer that sure that it makes
sense to fill it up with every commit in that commit range: performance
suffers when directories contain too many files.

For example, I had a case in the past where it took a minute just to
enumerate a directory, and even looking whether a file existed in that
directory was not exactly fun.

In any case, I would write it slightly shorter:

	test -e "$cachedir/$oldrev" ||
	echo "$newrev" >"$cachedir/$oldrev"

> +}
> +
>  rev_exists () {
>  	if git rev-parse "$1" >/dev/null 2>&1
>  	then
> @@ -375,13 +394,13 @@ find_existing_splits () {
>  			then
>  				# squash commits refer to a subtree
>  				debug "  Squash: $sq from $sub"
> -				cache_set "$sq" "$sub"
> +				cache_set_if_unset "$sq" "$sub"
>  			fi
>  			if test -n "$main" -a -n "$sub"
>  			then
>  				debug "  Prior: $main -> $sub"
> -				cache_set $main $sub
> -				cache_set $sub $sub
> +				cache_set_if_unset $main $sub
> +				cache_set_if_unset $sub $sub
>  				try_remove_previous "$main"
>  				try_remove_previous "$sub"
>  			fi
> @@ -688,6 +707,8 @@ process_split_commit () {
>  		if test -n "$newparents"
>  		then
>  			cache_set "$rev" "$rev"
> +		else
> +			cache_set "$rev" ""

Was this hunk intended to be snuck in here? I can understand the
s/cache_set/cache_set_if_unset/ changes, of course, but not this hunk.

>  		fi
>  		return
>  	fi
> @@ -785,7 +806,7 @@ cmd_split () {
>  			# the 'onto' history is already just the subdir, so
>  			# any parent we find there can be used verbatim
>  			debug "  cache: $rev"
> -			cache_set "$rev" "$rev"
> +			cache_set_if_unset "$rev" "$rev"
>  		done
>  	fi
>
> @@ -798,7 +819,7 @@ cmd_split () {
>  		git rev-list --topo-order --skip=1 $mainline |
>  		while read rev
>  		do
> -			cache_set "$rev" ""
> +			cache_set_if_unset "$rev" ""

Okay. A quite interesting question now would be: are there any callers of
`cache_set` left? If so, why?

Thanks,
Dscho

>  		done || exit $?
>  	fi
>
> --
> gitgitgadget
>
>

@@ -9,12 +9,15 @@ then
set -- -h
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> @@ -796,20 +810,60 @@ cmd_add_commit () {
>  }
>
>  cmd_map () {
> -	oldrev="$1"
> -	newrev="$2"
>
> -	if test -z "$oldrev"
> +	if test -z "$1"

I'd like to keep the nice name. Maybe if it is `local`, there is no longer
a need to replace `$oldrev` by `$1`?

>  	then
>  		die "You must provide a revision to map"
>  	fi
>
> +	oldrev=$(git rev-parse --revs-only "$1") || exit $?
> +	newrev=
> +
> +	if test -n "$2"
> +	then
> +		newrev=$(git rev-parse --revs-only "$2") || exit $?
> +	fi
> +

Would it not make more sense to validate the parameters before calling
`cmd_map`?

In any case, this strikes me like a subject for another commit.

Thanks,
Dscho

P.S.: I'll have to stop reviewing here for the moment, not sure whether
I'll come back to it later today or maybe tomorrow.


>  	cache_setup || exit $?
>  	cache_set "$oldrev" "$newrev"
>
>  	say "Mapped $oldrev => $newrev"
>  }
>
> +cmd_ignore () {
> +	revs=$(git rev-parse $default --revs-only "$@") || exit $?
> +	ensure_single_rev $revs
> +
> +	say "Ignoring $revs"
> +
> +	cache_setup || exit $?
> +
> +	git rev-list $revs |
> +	while read rev
> +	do
> +		cache_set "$rev" ""
> +	done
> +
> +	echo "$revs" >>"$cachedir/processed"
> +}
> +
> +cmd_use () {
> +	revs=$(git rev-parse $default --revs-only "$@") || exit $?
> +	ensure_single_rev $revs
> +
> +	say "Using existing subtree $revs"
> +
> +	cache_setup || exit $?
> +
> +	git rev-list $revs |
> +	while read rev
> +	do
> +		cache_set "$rev" "$rev"
> +	done
> +
> +	echo "$revs" >>"$cachedir/processed"
> +}
> +
>  cmd_split () {
>  	debug "Splitting $dir..."
>  	cache_setup || exit $?
> @@ -827,7 +881,7 @@ cmd_split () {
>  		done
>  	fi
>
> -	unrevs="$(find_existing_splits "$dir" "$revs")"
> +	unrevs="$(find_existing_splits "$dir" "$revs") $(exclude_processed_refs)"
>
>  	mainline="$(find_mainline_ref "$dir" "$revs")"
>  	if test -n "$mainline"
> --
> gitgitgadget
>
>

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 7, 2020

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Ed,

On Mon, 5 Oct 2020, Ed Maste wrote:

> On Mon, 5 Oct 2020 at 09:18, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > FWIW there have been more comments in favor of this patch in
> > https://github.com/gitgitgadget/git/pull/493.
> >
> > I guess what this patch series needs to proceed further is an ACK by
> > Avery?
>
> Avery says "if it's good enough for Johannes it's good enough for me"
> https://twitter.com/apenwarr/status/1313231132721401861

What Avery does not quite understand is that Johannes has not a lot of
clue about `git subtree`... ;-)

Seriously, I am not a user, unfamiliar with the implementation details
(although I am getting a bit more familiar through the review I started
and plan on finishing later). My biggest connection with `git subtree` is
that there seem to be a couple of Git for Windows users who actively use
that command (which is the reason why we include it in Git for Windows,
unlike many other things from `contrib/`).

Ciao,
Dscho

@@ -9,12 +9,15 @@ then
set -- -h
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> From: Tom Clarkson <tom@tqclarkson.com>
>
> Prevent a mainline commit without $dir being treated as a subtree
> commit and pulling in the entire mainline history. Any valid subtree
> commit will have only valid subtree commits as parents, which will be
> unchanged by check_parents.

I feel like this is only half the picture because I have a hard time
stitching these two sentences together.

After studying the code and your patch a bit, it appears to me that
`process_split_commit()` calls `check_parents()` first, which will call
`process_split_commit()` for all as yet unmapped parents. So basically, it
recurses until it found a commit all of whose parents are already mapped,
then permeates that information all the way back.

Doesn't this cause serious issues with stack overflows and all for long
commit histories?

> Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
> ---
>  contrib/subtree/git-subtree.sh | 24 +++++++++++-------------
>  1 file changed, 11 insertions(+), 13 deletions(-)
>
> diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
> index e56621a986..fa6293b372 100755
> --- a/contrib/subtree/git-subtree.sh
> +++ b/contrib/subtree/git-subtree.sh
> @@ -224,8 +224,6 @@ cache_setup () {
>  	fi
>  	mkdir -p "$cachedir" ||
>  		die "Can't create new cachedir: $cachedir"
> -	mkdir -p "$cachedir/notree" ||
> -		die "Can't create new cachedir: $cachedir/notree"

It might make sense to talk about this a bit in the commit message.
Essentially, you are replacing the `notree/<rev>` files by mapping `<rev>`
to the empty string.

This makes me wonder, again, whether the file system layout of the cache
can hold up to the demands. If a main project were to merge a subtree
with, say, 10 million commits, wouldn't that mean that `git subtree` would
now fill one directory with 10 million files? I cannot imagine that this
performs well, still.

>  	debug "Using cachedir: $cachedir" >&2
>  }
>
> @@ -255,18 +253,11 @@ check_parents () {
>  	local indent=$(($2 + 1))
>  	for miss in $missed
>  	do
> -		if ! test -r "$cachedir/notree/$miss"
> -		then
> -			debug "  unprocessed parent commit: $miss ($indent)"
> -			process_split_commit "$miss" "" "$indent"
> -		fi
> +		debug "  unprocessed parent commit: $miss ($indent)"
> +		process_split_commit "$miss" "" "$indent"

That makes sense to me, as the `missed` variable only contains as yet
unmapped commits, therefore we do not have to have an equivalent `test -r`
check.

Ciao,
Dscho

>  	done
>  }
>
> -set_notree () {
> -	echo "1" > "$cachedir/notree/$1"
> -}
> -
>  cache_set () {
>  	oldrev="$1"
>  	newrev="$2"
> @@ -719,11 +710,18 @@ process_split_commit () {
>  	# vs. a mainline commit?  Does it matter?
>  	if test -z "$tree"
>  	then
> -		set_notree "$rev"
>  		if test -n "$newparents"
>  		then
> -			cache_set "$rev" "$rev"
> +			if test "$newparents" = "$parents"
> +			then
> +				# if all parents were subtrees, this can be a subtree commit
> +				cache_set "$rev" "$rev"
> +			else
> +				# a mainline commit with tree missing is equivalent to the initial commit
> +				cache_set "$rev" ""
> +			fi
>  		else
> +			# no parents with valid subtree mappings means a commit prior to subtree add
>  			cache_set "$rev" ""
>  		fi
>  		return
> --
> gitgitgadget
>
>

@@ -52,6 +52,12 @@ useful elsewhere, you can extract its entire history and publish
that as its own git repository, without accidentally
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> From: Tom Clarkson <tom@tqclarkson.com>
>
> Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
> ---
>  contrib/subtree/git-subtree.txt | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>
> diff --git a/contrib/subtree/git-subtree.txt b/contrib/subtree/git-subtree.txt
> index 352deda69d..a5a76e8ce6 100644
> --- a/contrib/subtree/git-subtree.txt
> +++ b/contrib/subtree/git-subtree.txt
> @@ -52,6 +52,12 @@ useful elsewhere, you can extract its entire history and publish
>  that as its own git repository, without accidentally
>  intermingling the history of your application project.
>
> +Although the relationship between subtree and mainline commits is stored

As far as I can see, this is the first time the term "mainline commit" is
used in that file, and it has not really be defined what you mean by that.
I *guess* you are referring to commits in the main project that did not
come from any subtree project.

Maybe this can be described without needing a new term?

Ciao,
Dscho

> +in regular git history, it is also cached between subtree runs. In most
> +cases this is merely a performance improvement, but for projects with
> +large and complex histories the cache can be manipulated directly
> +with the use, ignore and map commands.
> +
>  [TIP]
>  In order to keep your commit messages clean, we recommend that
>  people split their commits between the subtrees and the main
> @@ -120,6 +126,21 @@ and friends will work as expected.
>  Note that if you use '--squash' when you merge, you should usually not
>  just '--rejoin' when you split.
>
> +ignore::
> +	Mark a commit and all of its history as irrelevant to subtree split.
> +	In most cases this would be handled automatically based on metadata
> +	from subtree join commits. Intended for improving performance on
> +	extremely large repos and excluding complex history that turns out
> +	to be otherwise problematic.
> +
> +use::
> +	Mark a commit and all of its history as part of an existing subtree.
> +	In normal circumstances this would be handled based on the metadata
> +	from the subtree join commit. Similar to the --onto option of split.
> +
> +map::
> +	Manually override the normal output of split for a particular commit.
> +	Extreme flexibility for advanced troubleshooting purposes only.
>
>  OPTIONS
>  -------
> @@ -142,6 +163,9 @@ OPTIONS
>  	This option is only valid for add, merge and pull (unsure).
>  	Specify <message> as the commit message for the merge commit.
>
> +--clear-cache::
> +	Reset the subtree cache and recalculate all subtree mappings from the
> +	commit history
>
>  OPTIONS FOR add, merge, push, pull
>  ----------------------------------
> --
> gitgitgadget
>

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 7, 2020

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> Fixes several issues that could occur when running subtree split on large
> repos with more complex history.
>
>  1. A merge commit could bypass the known start point of the subtree, which
>     would cause the entire history to be processed recursively, leading to a
>     stack overflow / segfault after reading a few hundred commits. Older
>     commits are now explicitly recorded as irrelevant so that the recursive
>     process can terminate on any mainline commit rather than only on subtree
>     joins and initial commits.
>
>
>  2. It is possible for a repo to contain subtrees that lack the metadata
>     that is usually present in add/join commit messages (git-svn at least
>     can produce such a structure). The new use/ignore/map commands allow the
>     user to provide that information for any problematic commits.
>
>
>  3. A mainline commit that does not contain the subtree folder could be
>     erroneously identified as a subtree commit, which would add the entire
>     mainline history to the subtree. Commits will now only be used as is if
>     all their parents are already identified as subtree commits. While the
>     new code can still be tripped up by unusual folder structures, the
>     completely unambiguous solution turned out to involve a significant
>     performance penalty, and the new ignore / use commands provide a
>     workaround for that scenario.

I gave this as thorough a review as I can (which is not saying too much,
as I am not exactly familiar with `git subtree`'s inner workings).

Hopefully some of my comments and suggestions are helpful.

At some stage, especially given the problems I pointed out with the
implementation detail that is a flat directory with a potentially insane
number of files in it, I think it would make a lot of sense to go ahead
and turn this into a built-in Git command, implemented in C, and with a
more robust file system layout of its cache.

Ciao,
Dscho

@Manishearth
Copy link

(Unfortunately my git mailing list subscription is on my inbox from my previous job, so I can't reply directly just yet, though I'm working on it)

@tqc Do you know what's blocking this from landing? Is it something that others can help with? For context, we're using subtree in the Rust project and the repo reliably crashes the shipped tool, and we would like for this patch to land as soon as it can.

@dscho
Copy link
Member

dscho commented Feb 18, 2021

(Unfortunately my git mailing list subscription is on my inbox from my previous job, so I can't reply directly just yet, though I'm working on it)

You don't need to be subscribed to post. You only need to be careful to send plain text mail, not even an alternate HTML or it will bounce. For details, see: https://github.com/gitgitgadget/gitgitgadget/wiki/ReplyToThis

@jzelinskie
Copy link

I was experiencing occasional segfaults on split and having split exit-1 consistently with the non-squashed subtrees on our relatively small repo (2.5k commits, with 3 subtrees each <100 commits). One offending subtree even has absolutely no commits to split -- git log reveals only the commit where the subtree was added. I couldn't reproduce the issue on a brand new repository which made it sound like the first condition described in this change. After applying this change locally, everything works fine.

It's fine that I temporarily take point with my patched git to manage anything where we need to perform splits, but I'd like my teammates to be able learn how to manage subtrees as well without bumping into this bug (which almost turned us away from subtrees altogether).

@dscho
Copy link
Member

dscho commented Jun 15, 2021

@jzelinskie this is currently stuck in @tqc's court (I provided the prerequisite review on the Git mailing list, but haven't seen any activity in response).

@sandercox
Copy link

For what it's worth. I've just used the changes on this PR to get subtree data from our repository and push it back upstream. This version is much faster and actually delivers, the original in 2.32 was just failing with Maximum function recursion depth (1000) reached and git-bash on windows had stranger errors about git not being available or something.

@klardotsh
Copy link

I also used this to detangle my company's changes to a library we had stored in a complicated monorepo. Worked great, where mainline git subtree segfaulted pretty quickly. I don't recall running into any issues once I had this built and installed in a throwaway Docker container.

@RoSk0
Copy link

RoSk0 commented Mar 22, 2022

I reported the same issue described in original thread which I found only after.

I've used git-filter-repo as a work around and it works great and way faster then original subtree split. Haven't compared with implementation from here though.

@RalfJung
Copy link

@jzelinskie this is currently stuck in @tqc's court (I provided the prerequisite review on the Git mailing list, but haven't seen any activity in response).

Doesn't look like @tqc is still following here, so I guess now the question is whether someone else is willing and able to pick this up?

None of the subtrees used in the Rust repository would have any chance of actually working without this patch. So all the people maintaining those subtrees have had to use a forked git for years, which is not a great situation.

@dscho
Copy link
Member

dscho commented Jul 25, 2022

I guess now the question is whether someone else is willing and able to pick this up?

@RalfJung This is also my guess.

However, this patch series has languished for so long that it will be quite challenging: I just gave this a try, and after an hour of rebasing and resolving merge conflicts and noticing that there are no added regression tests to build confidence in the patches, I have to shift my attention elsewhere.

If anyone wants to pick up from where I left off: here is the rebased branch, and here are the reviews I provided on the Git mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet